
Heart disease describes a range of conditions that affect your heart. Diseases under the heart disease umbrella include blood vessel diseases, such as coronary artery disease, heart rhythm problems (arrhythmias) and heart defects you’re born with (congenital heart defects), among others.
The term heart disease is often used interchangeably with the term cardiovascular disease. Cardiovascular disease generally refers to conditions that involve narrowed or blocked blood vessels that can lead to a heart attack, chest pain (angina) or stroke. Other heart conditions, such as those that affect your heart’s muscle, valves or rhythm, also are considered forms of heart disease.
Heart disease is one of the biggest causes of morbidity and mortality among the population of the world. Prediction of cardiovascular disease is regarded as one of the most important subjects in the section of clinical data analysis. The amount of data in the healthcare industry is huge. Data mining turns the large collection of raw healthcare data into information that can help to make informed decisions and predictions.
According to a news article, heart disease proves to be the leading cause of death for both women and men. The article states the following :
The World Health Organization (WHO) lists cardiovascular diseases as the leading cause of death globally with 17.9 million people dying every year. The risk of heart disease increases due to harmful behavior that leads to overweight and obesity, hypertension, hyperglycemia, and high cholesterol. Furthermore, the American Heart Association complements symptoms with weight gain (1–2 kg per day), sleep problems, leg swelling, chronic cough and high heart rate. Diagnosis is a problem for practitioners due the symptoms’ nature of being common to other conditions or confused with signs of aging.
This makes heart disease a major concern to be dealt with. But it is difficult to identify heart disease because of several contributory risk factors such as diabetes, high blood pressure, high cholesterol, abnormal pulse rate, and many other factors. Due to such constraints, scientists have turned towards modern approaches like Data Mining and Machine Learning for predicting the disease.
![]()
The dataset consists of 303 individuals data. There are 14 columns in the dataset, which are described below.
1. Age : displays the age of the individual.
2. Sex : displays the gender of the individual using the following format :
1 = male
0 = female
3. Chest-pain type : displays the type of chest-pain experienced by the individual using the following format :
1 = typical angina
2 = atypical angina
3 = non — anginal pain
4 = asymptotic
4. Resting Blood Pressure : displays the resting blood pressure value of an individual in mmHg (unit)
5. Serum Cholestrol : displays the serum cholesterol in mg/dl (unit)
6. Fasting Blood Sugar : compares the fasting blood sugar value of an individual with 120mg/dl.
If fasting blood sugar > 120mg/dl then : 1 (true)
else : 0 (false)
7. Resting ECG : displays resting electrocardiographic results
0 = normal
1 = having ST-T wave abnormality
2 = left ventricular hyperthrophy
8. Max heart rate achieved : displays the max heart rate achieved by an individual.
9. Exercise induced angina :
1 = yes
0 = no
10. ST depression induced by exercise relative to rest : displays the value which is an integer or float.
11. Peak exercise ST segment :
1 = upsloping
2 = flat
3 = downsloping
12. Number of major vessels (0–3) colored by flourosopy : displays the value as integer or float.
13. Thal : displays the thalassemia :
3 = normal
6 = fixed defect
7 = reversible defect
14. Diagnosis of heart disease : Displays whether the individual is suffering from heart disease or not :
0 = absence
1, 2, 3, 4 = present.
In the actual dataset, we had 76 features but for our study, we chose only the above 14 because :
1. Age : Age is the most important risk factor in developing cardiovascular or heart diseases, with approximately a tripling of risk with each decade of life. Coronary fatty streaks can begin to form in adolescence. It is estimated that 82 percent of people who die of coronary heart disease are 65 and older. Simultaneously, the risk of stroke doubles every decade after age 55.
2. Sex : Men are at greater risk of heart disease than pre-menopausal women. Once past menopause, it has been argued that a woman’s risk is similar to a man’s although more recent data from the WHO and UN disputes this. If a female has diabetes, she is more likely to develop heart disease than a male with diabetes.
3. Angina (Chest Pain) : Angina is chest pain or discomfort caused when your heart muscle doesn’t get enough oxygen-rich blood. It may feel like pressure or squeezing in your chest. The discomfort also can occur in your shoulders, arms, neck, jaw, or back. Angina pain may even feel like indigestion.
4. Resting Blood Pressure : Over time, high blood pressure can damage arteries that feed your heart. High blood pressure that occurs with other conditions, such as obesity, high cholesterol or diabetes, increases your risk even more.
5. Serum Cholesterol : A high level of low-density lipoprotein (LDL) cholesterol (the “bad” cholesterol) is most likely to narrow arteries. A high level of triglycerides, a type of blood fat related to your diet, also ups your risk of a heart attack. However, a high level of high-density lipoprotein (HDL) cholesterol (the “good” cholesterol) lowers your risk of a heart attack.
6. Fasting Blood Sugar : Not producing enough of a hormone secreted by your pancreas (insulin) or not responding to insulin properly causes your body’s blood sugar levels to rise, increasing your risk of a heart attack.
7. Resting ECG : For people at low risk of cardiovascular disease, the USPSTF concludes with moderate certainty that the potential harms of screening with resting or exercise ECG equal or exceed the potential benefits. For people at intermediate to high risk, current evidence is insufficient to assess the balance of benefits and harms of screening.
8. Max heart rate achieved : The increase in cardiovascular risk, associated with the acceleration of heart rate, was comparable to the increase in risk observed with high blood pressure. It has been shown that an increase in heart rate by 10 beats per minute was associated with an increase in the risk of cardiac death by at least 20%, and this increase in the risk is similar to the one observed with an increase in systolic blood pressure by 10 mm Hg.
9. Exercise induced angina : The pain or discomfort associated with angina usually feels tight, gripping or squeezing, and can vary from mild to severe. Angina is usually felt in the center of your chest but may spread to either or both of your shoulders, or your back, neck, jaw or arm. It can even be felt in your hands. o Types of Angina a. Stable Angina / Angina Pectoris b. Unstable Angina c. Variant (Prinzmetal) Angina d. Microvascular Angina.
10. Peak exercise ST segment : A treadmill ECG stress test is considered abnormal when there is a horizontal or down-sloping ST-segment depression ≥ 1 mm at 60–80 ms after the J point. Exercise ECGs with up-sloping ST-segment depressions are typically reported as an ‘equivocal’ test. In general, the occurrence of horizontal or down-sloping ST-segment depression at a lower workload (calculated in METs) or heart rate indicates a worse prognosis and higher likelihood of multi-vessel disease. The duration of ST-segment depression is also important, as prolonged recovery after peak stress is consistent with a positive treadmill ECG stress test. Another finding that is highly indicative of significant CAD is the occurrence of ST-segment elevation > 1 mm (often suggesting transmural ischemia); these patients are frequently referred urgently for coronary angiography.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import time
import plotly.graph_objs as go
import plotly.offline as py
import warnings
warnings.filterwarnings('ignore')
heart = pd.read_csv("heart.csv")
heart.head()
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
| 1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
| 2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
| 3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
| 4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
heart.shape
(303, 14)
heart.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 303 entries, 0 to 302 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 303 non-null int64 1 sex 303 non-null int64 2 cp 303 non-null int64 3 trestbps 303 non-null int64 4 chol 303 non-null int64 5 fbs 303 non-null int64 6 restecg 303 non-null int64 7 thalach 303 non-null int64 8 exang 303 non-null int64 9 oldpeak 303 non-null float64 10 slope 303 non-null int64 11 ca 303 non-null int64 12 thal 303 non-null int64 13 target 303 non-null int64 dtypes: float64(1), int64(13) memory usage: 33.3 KB
heart.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| age | 303.0 | 54.366337 | 9.082101 | 29.0 | 47.5 | 55.0 | 61.0 | 77.0 |
| sex | 303.0 | 0.683168 | 0.466011 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| cp | 303.0 | 0.966997 | 1.032052 | 0.0 | 0.0 | 1.0 | 2.0 | 3.0 |
| trestbps | 303.0 | 131.623762 | 17.538143 | 94.0 | 120.0 | 130.0 | 140.0 | 200.0 |
| chol | 303.0 | 246.264026 | 51.830751 | 126.0 | 211.0 | 240.0 | 274.5 | 564.0 |
| fbs | 303.0 | 0.148515 | 0.356198 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| restecg | 303.0 | 0.528053 | 0.525860 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 |
| thalach | 303.0 | 149.646865 | 22.905161 | 71.0 | 133.5 | 153.0 | 166.0 | 202.0 |
| exang | 303.0 | 0.326733 | 0.469794 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| oldpeak | 303.0 | 1.039604 | 1.161075 | 0.0 | 0.0 | 0.8 | 1.6 | 6.2 |
| slope | 303.0 | 1.399340 | 0.616226 | 0.0 | 1.0 | 1.0 | 2.0 | 2.0 |
| ca | 303.0 | 0.729373 | 1.022606 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| thal | 303.0 | 2.313531 | 0.612277 | 0.0 | 2.0 | 2.0 | 3.0 | 3.0 |
| target | 303.0 | 0.544554 | 0.498835 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
heart.nunique()
age 41 sex 2 cp 4 trestbps 49 chol 152 fbs 2 restecg 3 thalach 91 exang 2 oldpeak 40 slope 3 ca 5 thal 4 target 2 dtype: int64
heart.isnull().sum()
age 0 sex 0 cp 0 trestbps 0 chol 0 fbs 0 restecg 0 thalach 0 exang 0 oldpeak 0 slope 0 ca 0 thal 0 target 0 dtype: int64
heart.hist(bins = 45, figsize = (15,10))
plt.show()
fig, ax = plt.subplots(ncols = 5, nrows = 3, figsize = (15,8))
index = 0
ax = ax.flatten()
for col, value in heart.items():
if col != 'type':
sns.boxplot(y = col, data = heart, ax = ax[index])
index += 1
plt.tight_layout(pad = 0.5, w_pad = 0.7, h_pad = 5.0)
fig, ax = plt.subplots(ncols = 5, nrows = 3, figsize = (15,8))
index = 0
ax = ax.flatten()
for col, value in heart.items():
if col != 'type':
sns.distplot(value, ax = ax[index])
index += 1
plt.tight_layout(pad = 0.5, w_pad = 0.7, h_pad = 5.0)
heart['age'].value_counts()
58 19 57 17 54 16 59 14 52 13 51 12 62 11 60 11 44 11 56 11 41 10 64 10 63 9 67 9 65 8 55 8 61 8 53 8 45 8 43 8 42 8 50 7 66 7 48 7 46 7 49 5 47 5 70 4 39 4 68 4 35 4 69 3 40 3 38 3 71 3 37 2 34 2 76 1 29 1 74 1 77 1 Name: age, dtype: int64
heart['sex'].value_counts()
1 207 0 96 Name: sex, dtype: int64
sns.set_style("whitegrid")
plt.figure(figsize = (15,6))
sns.countplot(x = 'sex', data = heart)
<AxesSubplot:xlabel='sex', ylabel='count'>
heart['cp'].value_counts()
0 143 2 87 1 50 3 23 Name: cp, dtype: int64
plt.figure(figsize = (15,5))
sns.countplot(x = 'cp', data = heart)
<AxesSubplot:xlabel='cp', ylabel='count'>
heart['target'].value_counts()
1 165 0 138 Name: target, dtype: int64
plt.figure(figsize = (15,5))
sns.countplot(x = 'target', data = heart)
<AxesSubplot:xlabel='target', ylabel='count'>
temp = heart['target'].value_counts()
labels = temp.index
sizes = (temp / temp.sum())*100
trace = go.Pie(labels = labels, values = sizes, hoverinfo = 'label+percent')
layout = go.Layout(title = 'Diagnosed with Heart Disease %')
data = [trace]
fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename = "Diagnosed with Heart Disease")
plt.figure(figsize = (18,5))
sns.countplot(x = 'age', hue = 'target', data = heart)
<AxesSubplot:xlabel='age', ylabel='count'>
plt.figure(figsize = (18,5))
sns.distplot(heart['age'][heart['target'] == 1], hist = False, color = 'red')
sns.distplot(heart['age'][heart['target'] == 0], hist = False, color = 'blue')
plt.xlabel('Age ', fontsize = 16)
plt.title('Distribution Plot of Age with Target', fontsize = 16)
Text(0.5, 1.0, 'Distribution Plot of Age with Target')
plt.figure(figsize = (18,5))
sns.countplot(x = 'sex', hue = 'target', data = heart)
<AxesSubplot:xlabel='sex', ylabel='count'>
heart_sex = heart.groupby(["sex", "target"]).size()
heart_sex
sex target
0 0 24
1 72
1 0 114
1 93
dtype: int64
plt.pie(heart_sex.values, labels = ["Female not diagnosed Heart disease", "Female Diagnosed Heart disease",
"Male not diagnosed Heart disease", "Male Diagnosed Heart disease"], autopct = '%1.1f%%',
radius = 1.5, textprops = {"fontsize" : 16})
plt.show()
plt.figure(figsize = (18,5))
sns.countplot(x = 'cp', hue = 'target', data = heart)
<AxesSubplot:xlabel='cp', ylabel='count'>
plt.figure(figsize = (18,5))
sns.distplot(heart['chol'][heart['target'] == 1], hist = False, color = 'red')
sns.distplot(heart['chol'][heart['target'] == 0], hist = False, color = 'blue')
plt.xlabel('Serum Cholesterol ', fontsize = 16)
plt.title('Distribution Plot of Serum Cholesterol with Target', fontsize = 16)
Text(0.5, 1.0, 'Distribution Plot of Serum Cholesterol with Target')
plt.figure(figsize = (18,5))
sns.countplot(x = 'restecg', hue = 'target', data = heart)
<AxesSubplot:xlabel='restecg', ylabel='count'>
plt.figure(figsize = (18,10))
plt.title('Correlation of all the Columns', fontsize = 20)
sns.heatmap(heart.corr(), annot = True, vmin = -1, vmax = 1, center = 0, fmt = '.1g', linewidths = 1, linecolor = 'white',
square = True, cmap ='RdBu')
<AxesSubplot:title={'center':'Correlation of all the Columns'}>
heart.head()
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
| 1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
| 2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
| 3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
| 4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
X = heart.drop(columns = ['target'])
y = heart['target']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
(212, 13) (212,) (91, 13) (91,)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import GridSearchCV, cross_val_score
lr = LogisticRegression()
lr.fit(x_train, y_train)
LogisticRegression()
y_pred_train = lr.predict(x_train)
y_pred_test = lr.predict(x_test)
print("Training accuracy :", accuracy_score(y_pred_train, y_train))
print("Testing accuracy :", accuracy_score(y_pred_test, y_test))
Training accuracy : 0.8679245283018868 Testing accuracy : 0.8131868131868132
cm_test = confusion_matrix(y_test, y_pred_test)
print(cm_test)
[[32 9] [ 8 42]]
print('Classification report for train data is : \n',
classification_report(y_train, y_pred_train))
print('\n')
print('Classification report for test data is : \n',
classification_report(y_test, y_pred_test))
Classification report for train data is :
precision recall f1-score support
0 0.90 0.80 0.85 97
1 0.85 0.92 0.88 115
accuracy 0.87 212
macro avg 0.87 0.86 0.87 212
weighted avg 0.87 0.87 0.87 212
Classification report for test data is :
precision recall f1-score support
0 0.80 0.78 0.79 41
1 0.82 0.84 0.83 50
accuracy 0.81 91
macro avg 0.81 0.81 0.81 91
weighted avg 0.81 0.81 0.81 91
from sklearn import metrics
print('Error rate for Train Data is : \n',)
print('Mean Square Error (MSE) :', metrics.mean_squared_error(y_train, y_pred_train))
print('Mean Absolute Error :', metrics.mean_absolute_error(y_train, y_pred_train))
print('Root mean Square Error (RMSE) :', np.sqrt(metrics.mean_squared_error(y_train, y_pred_train)))
print('\n')
print('Error rate for Test Data is : \n',)
print('Mean Square Error (MSE) :', metrics.mean_squared_error(y_test, y_pred_test))
print('Mean Absolute Error :', metrics.mean_absolute_error(y_test, y_pred_test))
print('Root mean Square Error (RMSE) :', np.sqrt(metrics.mean_squared_error(y_test, y_pred_test)))
Error rate for Train Data is : Mean Square Error (MSE) : 0.1320754716981132 Mean Absolute Error : 0.1320754716981132 Root mean Square Error (RMSE) : 0.3634218921558155 Error rate for Test Data is : Mean Square Error (MSE) : 0.18681318681318682 Mean Absolute Error : 0.18681318681318682 Root mean Square Error (RMSE) : 0.4322189107537832
from sklearn.metrics import roc_curve
FPR_lr_train, TPR_lr_train, Thresholds_train = roc_curve(y_train, lr.predict_proba(x_train)[:,1])
fpr_series = pd.Series(FPR_lr_train)
tpr_series = pd.Series(TPR_lr_train)
thresholds_series = pd.Series(Thresholds_train)
FPR_lr_test, TPR_lr_test, Thresholds_test = roc_curve(y_test, lr.predict_proba(x_test)[:,1])
fpr_series = pd.Series(FPR_lr_test)
tpr_series = pd.Series(TPR_lr_test)
thresholds_series = pd.Series(Thresholds_test)
from sklearn.metrics import roc_curve
sns.set_style("whitegrid")
plt.figure(figsize = (20,8))
plt.plot(FPR_lr_train, TPR_lr_train, label = 'Train AUR Curve')
plt.plot(FPR_lr_test, TPR_lr_test, label = 'Test AUR Curve')
# Plot Base Rate ROC
plt.plot([0,1], [0,1], label = 'Base Rate')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize = 20)
plt.ylabel('True Positive Rate', fontsize = 20)
plt.title('ROC Graph', fontsize = 20)
plt.legend(loc = "lower right")
plt.show()
from sklearn.metrics import auc
FPR_lr_train, TPR_lr_train, Thresholds_train = roc_curve(y_train, y_pred_train)
FPR_lr_test, TPR_lr_test, Thresholds_test = roc_curve(y_test, y_pred_test)
plt.figure(figsize = (20,8))
plt.grid()
plt.plot(FPR_lr_train, TPR_lr_train, label = " AUC TRAIN = "+str(auc(FPR_lr_train, TPR_lr_train)))
plt.plot(FPR_lr_test, TPR_lr_test, label = " AUC TEST = "+str(auc(FPR_lr_test, TPR_lr_test)))
plt.plot([0,1],[0,1],'g--')
plt.legend()
plt.xlabel("True Positive Rate")
plt.ylabel("False Positive Rate")
plt.title("AUC(ROC curve)")
plt.grid(color = 'black', linestyle = '-', linewidth = 0.5)
plt.show()
import statsmodels.api as sm
logit_model = sm.Logit(y,X)
result = logit_model.fit()
print(result.summary())
Optimization terminated successfully.
Current function value: 0.351932
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: target No. Observations: 303
Model: Logit Df Residuals: 290
Method: MLE Df Model: 12
Date: Mon, 13 Sep 2021 Pseudo R-squ.: 0.4893
Time: 17:16:22 Log-Likelihood: -106.64
converged: True LL-Null: -208.82
Covariance Type: nonrobust LLR p-value: 4.088e-37
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
age 0.0128 0.019 0.670 0.503 -0.025 0.050
sex -1.6381 0.452 -3.625 0.000 -2.524 -0.752
cp 0.8490 0.184 4.613 0.000 0.488 1.210
trestbps -0.0153 0.010 -1.562 0.118 -0.035 0.004
chol -0.0036 0.004 -0.960 0.337 -0.011 0.004
fbs -0.0115 0.526 -0.022 0.983 -1.042 1.019
restecg 0.5432 0.342 1.589 0.112 -0.127 1.213
thalach 0.0319 0.008 3.779 0.000 0.015 0.048
exang -0.8920 0.403 -2.215 0.027 -1.681 -0.103
oldpeak -0.4988 0.209 -2.381 0.017 -0.909 -0.088
slope 0.6092 0.346 1.761 0.078 -0.069 1.287
ca -0.7725 0.189 -4.080 0.000 -1.144 -0.401
thal -0.8438 0.287 -2.937 0.003 -1.407 -0.281
==============================================================================
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(x_train, y_train)
DecisionTreeClassifier()
y_pred_dtc_train = dtc.predict(x_train)
y_pred_dtc_test = dtc.predict(x_test)
print("Training accuracy :", accuracy_score(y_pred_dtc_train, y_train))
print("Testing accuracy :", accuracy_score(y_pred_dtc_test, y_test))
Training accuracy : 1.0 Testing accuracy : 0.7362637362637363
cm_test = confusion_matrix(y_test, y_pred_dtc_test)
print(cm_test)
[[33 8] [16 34]]
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
RandomForestClassifier()
y_pred_rfc_train = rfc.predict(x_train)
y_pred_rfc_test = rfc.predict(x_test)
print("Training accuracy :", accuracy_score(y_pred_rfc_train, y_train))
print("Testing accuracy :", accuracy_score(y_pred_rfc_test, y_test))
Training accuracy : 1.0 Testing accuracy : 0.8241758241758241
cm_test = confusion_matrix(y_test, y_pred_rfc_test)
print(cm_test)
[[32 9] [ 7 43]]
from sklearn.svm import SVC
svm = SVC()
svm.fit(x_train, y_train)
SVC()
y_pred_svm_train = svm.predict(x_train)
y_pred_svm_test = svm.predict(x_test)
print("Training accuracy :", accuracy_score(y_pred_svm_train, y_train))
print("Testing accuracy :", accuracy_score(y_pred_svm_test, y_test))
Training accuracy : 0.660377358490566 Testing accuracy : 0.7032967032967034
from sklearn.neighbors import KNeighborsClassifier
for i in range(1,21):
neigh = KNeighborsClassifier(n_neighbors = i)
neigh.fit(x_train, y_train)
KNeighborsClassifier(n_neighbors=20)
y_pred_knn_train = neigh.predict(x_train)
y_pred_knn_test = neigh.predict(x_test)
print("Train Accuracy : ", accuracy_score(y_pred_knn_train, y_train))
print("Test Accuracy : ", accuracy_score(y_pred_knn_test, y_test))
Train Accuracy : 0.6933962264150944 Test Accuracy : 0.6813186813186813
from xgboost import XGBClassifier
gbm = XGBClassifier()
gbm.fit(x_train, y_train)
[17:16:23] WARNING: ..\src\learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=12, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
y_pred_xgb_train = gbm.predict(x_train)
y_pred_xgb_test = gbm.predict(x_test)
print("Train Accuracy : ", accuracy_score(y_pred_xgb_train, y_train))
print("Test Accuracy : ", accuracy_score(y_pred_xgb_test, y_test))
Train Accuracy : 1.0 Test Accuracy : 0.8021978021978022
logreg = LogisticRegression()
params = {'C': np.logspace(-1, 1, 50),
'penalty': ['None', 'l1', 'l2'],
'solver': ['newton-cg', 'liblinear', 'sag', 'saga'],
'max_iter': [100, 150, 200, 500, 1000]
}
log_grid = GridSearchCV(estimator = logreg,
param_grid = params,
scoring = 'accuracy',
cv = 5,
n_jobs = -1)
log_grid.fit(x_train, y_train)
GridSearchCV(cv=5, estimator=LogisticRegression(), n_jobs=-1,
param_grid={'C': array([ 0.1 , 0.10985411, 0.12067926, 0.13257114, 0.14563485,
0.15998587, 0.17575106, 0.19306977, 0.21209509, 0.23299518,
0.25595479, 0.28117687, 0.30888436, 0.33932218, 0.37275937,
0.40949151, 0.44984327, 0.49417134, 0.54286754, 0.59636233,
0.65512856, 0.71968567, 0.79060432, 0.86851137, 0.95409548,
1.04811313, 1.1513954 , 1.26485522, 1.38949549, 1.52641797,
1.67683294, 1.84206997, 2.02358965, 2.22299648, 2.44205309,
2.6826958 , 2.9470517 , 3.23745754, 3.55648031, 3.90693994,
4.29193426, 4.71486636, 5.17947468, 5.68986603, 6.25055193,
6.86648845, 7.54312006, 8.28642773, 9.10298178, 10. ]),
'max_iter': [100, 150, 200, 500, 1000],
'penalty': ['None', 'l1', 'l2'],
'solver': ['newton-cg', 'liblinear', 'sag', 'saga']},
scoring='accuracy')
print("Best parameters for the model :", log_grid.best_params_)
print("Best score for the model :", log_grid.best_score_)
Best parameters for the model : {'C': 0.281176869797423, 'max_iter': 100, 'penalty': 'l1', 'solver': 'liblinear'}
Best score for the model : 0.8299003322259135
y_pred_hyp_train = log_grid.predict(x_train)
y_pred_hyp_test = log_grid.predict(x_test)
print("Training accuracy :", accuracy_score(y_pred_hyp_train, y_train))
print("Testing accuracy :", accuracy_score(y_pred_hyp_test, y_test))
Training accuracy : 0.8537735849056604 Testing accuracy : 0.8241758241758241
from sklearn.model_selection import RandomizedSearchCV
log_grid_Rs = RandomizedSearchCV(LogisticRegression(),
param_distributions = params,
scoring = 'accuracy',
cv = 5,
verbose = True)
log_grid_Rs.fit(x_train, y_train)
Fitting 5 folds for each of 10 candidates, totalling 50 fits
RandomizedSearchCV(cv=5, estimator=LogisticRegression(),
param_distributions={'C': array([ 0.1 , 0.10985411, 0.12067926, 0.13257114, 0.14563485,
0.15998587, 0.17575106, 0.19306977, 0.21209509, 0.23299518,
0.25595479, 0.28117687, 0.30888436, 0.33932218, 0.37275937,
0.40949151, 0.44984327, 0.49417134, 0.54286754, 0.59636233,
0.65512856, 0.71968567, 0.79060432, 0.86851137, 0.9540...
1.67683294, 1.84206997, 2.02358965, 2.22299648, 2.44205309,
2.6826958 , 2.9470517 , 3.23745754, 3.55648031, 3.90693994,
4.29193426, 4.71486636, 5.17947468, 5.68986603, 6.25055193,
6.86648845, 7.54312006, 8.28642773, 9.10298178, 10. ]),
'max_iter': [100, 150, 200, 500, 1000],
'penalty': ['None', 'l1', 'l2'],
'solver': ['newton-cg', 'liblinear',
'sag', 'saga']},
scoring='accuracy', verbose=True)
print("Best parameters for the model :", log_grid_Rs.best_params_)
print("Best score for the model :", log_grid_Rs.best_score_)
Best parameters for the model : {'solver': 'liblinear', 'penalty': 'l1', 'max_iter': 100, 'C': 5.689866029018296}
Best score for the model : 0.820376522702104
y_pred_hyp1_train = log_grid_Rs.predict(x_train)
y_pred_hyp1_test = log_grid_Rs.predict(x_test)
print("Training accuracy :", accuracy_score(y_pred_hyp1_train, y_train))
print("Testing accuracy :", accuracy_score(y_pred_hyp1_test, y_test))
Training accuracy : 0.8726415094339622 Testing accuracy : 0.8021978021978022
params = {
'n_estimators':[100, 200, 300],
'max_depth': [3, 5, 10],
'min_samples_split': np.arange(2,20,2),
'min_samples_leaf': np.arange(1,20,2),
'criterion': ["gini", "entropy"]
}
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
model = RandomForestClassifier()
grid_search = GridSearchCV(estimator = model,
param_grid = params,
cv = 5,
n_jobs = -1,
verbose = 1,
scoring = "accuracy")
grid_search.fit(x_train, y_train)
Fitting 5 folds for each of 1620 candidates, totalling 8100 fits
GridSearchCV(cv=5, estimator=RandomForestClassifier(), n_jobs=-1,
param_grid={'criterion': ['gini', 'entropy'],
'max_depth': [3, 5, 10],
'min_samples_leaf': array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]),
'min_samples_split': array([ 2, 4, 6, 8, 10, 12, 14, 16, 18]),
'n_estimators': [100, 200, 300]},
scoring='accuracy', verbose=1)
print("Best parameters for the model :", grid_search.best_params_)
print("Best score for the model :", grid_search.best_score_)
Best parameters for the model : {'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 11, 'min_samples_split': 8, 'n_estimators': 100}
Best score for the model : 0.8535991140642304
train_rfc_pred = grid_search.predict(x_train)
test_rfc_pred = grid_search.predict(x_test)
print("Train Accuracy : ",accuracy_score(y_train, train_rfc_pred))
print("Test Accuracy : ",accuracy_score(y_test, test_rfc_pred))
Train Accuracy : 0.8820754716981132 Test Accuracy : 0.8461538461538461
knn = KNeighborsClassifier()
params = {'n_neighbors':list(range(1,21,)),
'p':[1, 2, 3, 4],
'leaf_size':list(range(1,50,3)),
'weights':['uniform', 'distance']
}
knn_grid = GridSearchCV(estimator = knn,
param_grid = params,
scoring = 'accuracy',
cv = 5,
n_jobs = -1)
knn_grid.fit(x_train, y_train)
GridSearchCV(cv=5, estimator=KNeighborsClassifier(), n_jobs=-1,
param_grid={'leaf_size': [1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31,
34, 37, 40, 43, 46, 49],
'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20],
'p': [1, 2, 3, 4],
'weights': ['uniform', 'distance']},
scoring='accuracy')
print("Best parameters for the model :", knn_grid.best_params_)
print("Best score for the model :", knn_grid.best_score_)
Best parameters for the model : {'leaf_size': 1, 'n_neighbors': 3, 'p': 1, 'weights': 'uniform'}
Best score for the model : 0.68859357696567
y_pred_knn_train = knn_grid.predict(x_train)
y_pred_knn_test = knn_grid.predict(x_test)
print("Train Accuracy : ", accuracy_score(y_pred_knn_train, y_train))
print("Test Accuracy : ", accuracy_score(y_pred_knn_test, y_test))
Train Accuracy : 0.7830188679245284 Test Accuracy : 0.6703296703296703
model = SVC()
param = {'C': [0.01, 0.001, 0.0001, 0.1, 0.8, 0.9, 1 ,1.1 ,1.2 ,1.3 ,1.4],
'kernel':['linear', 'rbf'],
'gamma' :[1, 1.1, 1.2, 1.3, 1.4]
}
grid_svc = GridSearchCV(model,
param_grid = param,
scoring = 'accuracy',
cv = 4)
grid_svc.fit(x_train, y_train)
GridSearchCV(cv=4, estimator=SVC(),
param_grid={'C': [0.01, 0.001, 0.0001, 0.1, 0.8, 0.9, 1, 1.1, 1.2,
1.3, 1.4],
'gamma': [1, 1.1, 1.2, 1.3, 1.4],
'kernel': ['linear', 'rbf']},
scoring='accuracy')
print("Best parameters for the model :", grid_svc.best_params_)
print("Best score for the model :", grid_svc.best_score_)
Best parameters for the model : {'C': 0.1, 'gamma': 1, 'kernel': 'linear'}
Best score for the model : 0.8018867924528303
train_svm_hyp_pred = grid_svc.predict(x_train)
test_svm_hyp_pred = grid_svc.predict(x_test)
print("Train Accuracy : ",accuracy_score(y_train, train_svm_hyp_pred))
print("Test Accuracy : ",accuracy_score(y_test, test_svm_hyp_pred))
Train Accuracy : 0.8726415094339622 Test Accuracy : 0.8131868131868132
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(x_train, y_train)
GaussianNB()
y_pred_gnb_train = clf.predict(x_train)
y_pred_gnb_test = clf.predict(x_test)
print("Train Accuracy : ", accuracy_score(y_pred_gnb_train, y_train))
print("Test Accuracy : ", accuracy_score(y_pred_gnb_test, y_test))
Train Accuracy : 0.8301886792452831 Test Accuracy : 0.8351648351648352
from sklearn.ensemble import AdaBoostClassifier
adc = AdaBoostClassifier()
adc.fit(x_train, y_train)
AdaBoostClassifier()
y_pred_abc_train = adc.predict(x_train)
y_pred_abc_test = adc.predict(x_test)
print("Train Accuracy : ", accuracy_score(y_pred_abc_train, y_train))
print("Test Accuracy : ", accuracy_score(y_pred_abc_test, y_test))
Train Accuracy : 0.9245283018867925 Test Accuracy : 0.8021978021978022
importance = grid_search.best_estimator_.feature_importances_
feature_imp = pd.Series(importance, index = heart.columns[:13]).sort_values(ascending = False)
feature_imp
plt.figure(figsize = (15, 8))
sns.barplot(x = feature_imp, y = feature_imp.index)
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.show()
FPR_rfc_train, TPR_rfc_train, Thresholds_train = roc_curve(y_train, grid_search.predict_proba(x_train)[:,1])
fpr_series = pd.Series(FPR_rfc_train)
tpr_series = pd.Series(TPR_rfc_train)
thresholds_series = pd.Series(Thresholds_train)
FPR_rfc_test, TPR_rfc_test, Thresholds_test = roc_curve(y_test, grid_search.predict_proba(x_test)[:,1])
fpr_series = pd.Series(FPR_rfc_test)
tpr_series = pd.Series(TPR_rfc_test)
thresholds_series = pd.Series(Thresholds_test)
from sklearn.metrics import roc_curve
sns.set_style("whitegrid")
plt.figure(figsize = (20,8))
plt.plot(FPR_rfc_train, TPR_rfc_train, label = 'Train AUR Curve')
plt.plot(FPR_rfc_test, TPR_rfc_test, label = 'Test AUR Curve')
# Plot Base Rate ROC
plt.plot([0,1], [0,1], label = 'Base Rate')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize = 20)
plt.ylabel('True Positive Rate', fontsize = 20)
plt.title('ROC Graph', fontsize = 20)
plt.legend(loc = "lower right")
plt.show()
FPR_rfc_train, TPR_rfc_train, Thresholds_train = roc_curve(y_train, train_rfc_pred)
FPR_rfc_test, TPR_rfc_test, Thresholds_test = roc_curve(y_test, test_rfc_pred)
plt.figure(figsize = (20,8))
plt.grid()
plt.plot(FPR_rfc_train, TPR_rfc_train, label = " AUC TRAIN = "+str(auc(FPR_rfc_train, TPR_rfc_train)))
plt.plot(FPR_rfc_test, TPR_rfc_test, label = " AUC TEST = "+str(auc(FPR_rfc_test, TPR_rfc_test)))
plt.plot([0,1],[0,1],'g--')
plt.legend()
plt.xlabel("True Positive Rate")
plt.ylabel("False Positive Rate")
plt.title("AUC(ROC curve)")
plt.grid(color = 'black', linestyle = '-', linewidth = 0.5)
plt.show()
cm_test = confusion_matrix(y_test, test_rfc_pred)
print(cm_test)
[[32 9] [ 5 45]]
print('Classification report for train data is : \n',
classification_report(y_train, train_rfc_pred))
print('\n')
print('Classification report for test data is : \n',
classification_report(y_test, test_rfc_pred))
Classification report for train data is :
precision recall f1-score support
0 0.89 0.85 0.87 97
1 0.88 0.91 0.89 115
accuracy 0.88 212
macro avg 0.88 0.88 0.88 212
weighted avg 0.88 0.88 0.88 212
Classification report for test data is :
precision recall f1-score support
0 0.86 0.78 0.82 41
1 0.83 0.90 0.87 50
accuracy 0.85 91
macro avg 0.85 0.84 0.84 91
weighted avg 0.85 0.85 0.85 91
auc(FPR_rfc_train, TPR_rfc_train)
0.8792021515015689
#Plot ROC curve and calculate the AUC metric
from sklearn.metrics import plot_roc_curve
plot_roc_curve(grid_search, x_test, y_test)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x1a56e7fa608>
tp, fn, fp, tn = confusion_matrix(y_test, test_rfc_pred, labels = [1,0]).ravel()
tp, tn, fp, fn
(45, 32, 9, 5)
precision_rate = tp / (tp + fp)
recall_rate = tp / (tp + fn)
print("The precision rate is : ", precision_rate)
print("The recall rate is : ", recall_rate)
The precision rate is : 0.8333333333333334 The recall rate is : 0.9
Cardiovascular diseases is the top one killer for many years. I think that the reasons are our lack of knowledge about heart disease and the life habits. According to the model and features analysis, we know which features that we can do regular self-examinations.
I think the most obvious symptom is chest pain. There are three types of chest pain, but only atypical angina is strongly related to the heart disease. No matter which type of chest pain you have, go to the doctor in case.
In addition, everyone should always keep an eye on the resting blood pressure. The ideal resting blood pressure is lower than 120mmHg, but if your blood pressure is much lower than the 120mmHg, it means that you are under high risk of heart disease. Besides, the problem will not be only heart when the blood pressure is higher than 150mmHg.
Lots and lots of electronic devices that can measure heart rate, so it's easier to monitor your own. Record max heart rate to make sure that your heart is still healthy. Once the rate rises year by year, there must be something wrong with you.
No matter how healthy we are, we must do annual examination because another features can not be taken care of by ourselves. Finally, don't forget the older we are, the higher the risks are.